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Abstract 


In this paper, we propose the new fixed- 
size ordinally-forgetting encoding (FOFE) 
method, which can almost uniquely en¬ 
code any variable-length sequence of 
words into a fixed-size representation. 
FOFE can model the word order in a se¬ 
quence using a simple ordinally-forgetting 
mechanism according to the positions of 
words. In this work, we have applied 
FOFE to feedforward neural network lan¬ 
guage models (FNN-LMs). Experimental 
results have shown that without using any 
recurrent feedbacks, FOFE based FNN- 
LMs can significantly outperform not only 
the standard fixed-input FNN-LMs but 
also the popular recurrent neural network 
(RNN) LMs. 


1 Introduction 


Language models play an important role in many 
applications like speech recognition, machine 
translation, information retrieval and nature lan¬ 
guage understanding. Traditionally, the back-off 
n-gram models [ Katz, 1987| Kneser, 1995] are 
the standard approach to language modeling. Re¬ 
cently, neural networks have been successfully 
applied to language modeling and have achieved 
the state-of-the-art performance in many tasks. 
In neural network language models (NNLM), the 
feedforward neural networks (FNN) and recurrent 
neural networks (RNN) | Elman, 1990| arc two 
popular architectures. The basic idea of NNLMs is 
to use a projection layer to project discrete words 
into a continuous space and estimate word con¬ 
ditional probabilities in this space, which may be 
smoother to better generalize to unseen contexts. 


FNN language models (FNN-LM) [Bengio and 
Ducharme, 2001 Bengio, 2003| usually use a lim¬ 
ited history within a fixed-size context window 


to predict the next word. RNN language mod¬ 
els (RNN-LM) IMikolov, 20TU1 |Mikolov, ~20l2l 
adopt a time-delayed recursive architecture for the 
hidden layers to memorize the long-term depen¬ 
dency in language. Therefore, it is widely re¬ 
ported that RNN-LMs usually outperform FNN- 
LMs in language modeling. While RNNs are the¬ 
oretically powerful, the learning of RNNs needs to 
use the so-called back-propagation through time 
(BPTT) [Werbos, 1990[ due to the internal recur¬ 
rent feedback cycles. The BPTT significantly in¬ 
creases the computational complexity of the learn¬ 
ing algorithms and it may cause many problems 
in learning, such as gradient vanishing and ex¬ 
ploding | Bengio, 1994| . More recently, some 
new architectures have been proposed to solve 
these problems. For example, the long short 
term memory (LSTM) RNN I jHochreiter, 1997| 
is an enhanced architecture to implement the re¬ 
current feedbacks using various learnable gates, 
and it has obtained promising results on hand¬ 
writing recognition (Graves, 2009 1 and sequence 
modeling [ Graves, 2013| . Moreover, the so- 
called temporal-kernel recurrent neural networks 
(TKRNN) [ [Sutskcver, 2010) have been proposed 
to handle the gradient vanishing problem. The 
main idea of TKRNN is to add direct connections 
between units in all time steps and every unit is im¬ 
plemented as an efficient leaky integrator, which 
makes it easier to learn the long-term dependency. 
Along this line, a temporal-kernel model has been 


successfully used for language modeling in [Shi, 

12015 ) . 


Comparing with RNN-LMs, FNN-LMs can be 
learned in a simpler and more efficient way. How¬ 
ever, FNN-LMs can not model the long-term de¬ 
pendency in language due to the fixed-size input 
window. In this paper, we propose a novel encod¬ 
ing method for discrete sequences, named fixed- 
size ordinally-forgetting encoding (FOFE), which 
can almost uniquely encode any variable-length 































word sequence into a fixed-size code. Relying 
on a constant forgetting factor, FOFE can model 
the word order in a sequence based on a sim¬ 
ple ordinally-forgetting mechanism, which uses 
the position of each word in the sequence. Both 
the theoretical analysis and the experimental sim¬ 
ulation have shown that FOFE can provide al¬ 
most unique codes for variable-length word se¬ 
quences as long as the forgetting factor is prop¬ 
erly selected. In this work, we apply FOFE to 
neural network language models, where the fixed- 
size FOFE codes are fed to FNNs as input to 
predict next word, enabling FNN-EMs to model 
long-term dependency in language. Experiments 
on two benchmark tasks, Penn Treebank Corpus 
(PTB) and Large Text Compression Benchmark 
(LTCB), have shown that FOFE-based FNN-LMs 
can not only significantly outperform the stan¬ 
dard fixed-input FNN-LMs but also achieve better 
performance than the popular RNN-LMs with or 
without using LSTM. Moreover, our implementa¬ 
tion also shows that FOFE based FNN-LMs can 
be learned very efficiently on GPUs without the 
complex BPTT procedure. 

2 Our Approach: FOFE 

Assume vocabulary size is K, NNLMs adopt the 
1-of-K encoding vectors as input. In this case, 
each word in vocabulary is represented as a one- 
hot vector e e The 1-of-K representation is a 
context independent encoding method. When the 
1-of-K representation is used to model a word in a 
sequence, it can not model its history or context. 

2.1 Fixed-size Ordinally Forgetting Encoding 

We propose a simple context-dependent encoding 
method for any sequence consisting of discrete 
symbols, namely fixed-size ordinally-forgetting 
encoding (FOFE). Given a sequence of words (or 
any discrete symbols), S = {wi,W2,-'' > u, t}> 
each word wt is first represented by a 1-of-K rep¬ 
resentation e t , from the first word t = 1 to the end 
of the sequence t = T, FOFE encodes each par¬ 
tial sequence (history) based on a simple recursive 
formula (with z () = 0) as: 

z t = a-z t - 1 + e t (1 < t < T) (1) 

where z t denotes the FOFE code for the partial 
sequence up to wt, and a(0<a<l)isa con¬ 
stant forgetting factor to control the influence of 
the history on the current position. Let’s take a 
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Figure 1: The FOFE-based FNN language model. 

simple example here, assume we have three sym¬ 
bols in vocabulary, e.g., A, B , C, whose 1-of- 
K codes are [1,0, 0], [0,1,0] and [0,0,1] respec¬ 
tively. In this case, the FOFE code for the se¬ 
quence {ABC} is [cr 2 , a, 1], and that of {ABCBC} 
is [a 4 , a + a 3 ,1 + a 2 ]. 

Obviously, FOFE can encode any variable- 
length discrete sequence into a fixed-size code. 
Moreover, it is a recursive context dependent en¬ 
coding method that smartly models the order in¬ 
formation by various powers of the forgetting fac¬ 
tor. Furthermore, FOFE has an appealing property 
in modeling natural languages that the far-away 
context will be gradually forgotten due to a < 1 
and the nearby contexts play much larger role in 
the resultant FOFE codes. 

2.2 Uniqueness of FOFE codes 

Given the vocabulary (of K symbols), for any se¬ 
quence S with a length of T, based on the FOFE 
code z t computed as above, if we can always de¬ 
code the original sequence S unambiguously (per¬ 
fectly recovering S from z t), we say FOFE is 
unique. 

Theorem 1 If the forgetting factor a satisfies 0 < 
a < 0.5, FOFE is unique for any I\ and T. 

The proof is simple because if the FOFE code 
has a value cfi in its i-th element, we may de¬ 
termine the word Wi occurs in the position t of 
S without ambiguity since no matter how many 
times Wi occurs in the far-away contexts (< t), 
they do not sum to cf (due to a < 0.5). If Wi ap¬ 
pears in any closer context (> t), the i-th element 















Figure 2: Numbers of collisions in simulation. 


must be larger than of. 

For 0.5 < a < 1, we have the following theo¬ 
rem: 


Theorem 2 For 0.5 < a < 1, given any finite 
values of I\ and T, FOFE is almost unique every¬ 
where for a G (0.5,1.0), except only a finite set of 
countable choices of a. 


The complete proof | Oguz, 2015) is given in 
Appendix A. Based on Theorem [2] FOFE is 
unique almost everywhere between (0.5,1.0) only 
except a countable set of isolated choices of a. In 
practice, the chance to exactly choose these iso¬ 
lated values between (0.5,1.0) is extremely slim, 
realistically almost impossible due to quantization 
eiTors in the system. To verify this, we have run 
simulation experiments for all possible sequences 
up to T = 20 symbols to count the number of 
collisions. Each collision is defined as the maxi¬ 
mum element-wise difference between two FOFE 
codes (generated from two different sequences) is 
less than a small threshold e. In Figure [2j we 
have shown the number of collisions (out of the 
total 2 20 tested cases) for various a values when 
e = 0.01, 0.001 and O.OOOljj] The simulation 
experiments have shown that the chance of col¬ 
lision is extremely small even when we allow a 
word to appear any times in the context. Ob¬ 
viously, in a natural language, a word normally 
does not appeal - repeatedly within a near context. 
Moreover, we have run the simulation to exam¬ 
ine whether collisions actually occur in two real 
text corpora, namely PTB (1M words) and LTCB 
(160M words), using e = 0.01, we have not ob¬ 


1 When we use a bigger value for a, the magnitudes of the 
resultant FOFE codes become much larger. As a result, the 

number of collisions (as measured by a fixed absolute thresh¬ 
old e) becomes smaller. 


served a single collision for nine different a values 
between [0.55,1.0] (incremental 0.05). 


2.3 Implement FOFE for FNN-LMs 

The architecture of a FOFE based neural network 
language model (FOFE-FNNEM) is as shown in 
Figure [T] It is similar to standard bigram FNN- 
FMs except that it uses a FOFE code to feed into 
neural network EM at each time instance. More¬ 
over, the FOFE can be easily scaled to other n- 
gram based neural network FMs. For example, 
Figure [3] is an illustration of fixed-size ordinally 
forgetting encoding based tri-gram neural network 
language model. 

FOFE is a simple recursive encoding method 
but a direct sequential implementation may not be 
efficient for the parallel computation platform like 
GPUs. Here, we will show that the FOFE compu¬ 
tation can be efficiently implemented as sentence- 
by-sentence matrix multiplications, which are par¬ 
ticularly suitable for the mini-batch based stochas¬ 
tic gradient descent (SGD) method running on 
GPUs. 

Given a sentence, S = {uq, u> 2 , • • • , 
where each word is represented by a 1-of-K code 
as e t (1 < t < T). The FOFE codes for all par¬ 
tial sequences in S can be computed based on the 
following matrix multiplication: 
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where V is a matrix arranging all 1-of-K codes 
of the words in the sentence row by row, and M 
is a T -th order lower triangular matrix. Each row 
vector of S represents a FOFE code of the partial 
sequence up to each position in the sentence. 

This matrix formulation can be easily extended 
to a mini-batch consisting of several sentences. 
Assume that a mini-batch is composed of N se¬ 
quences, C = {Si S2 • • • Sjv}, we can compute 
the FOFE codes for all sentences in the mini-batch 
as follows: 
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Table 1: The size of PTB and LTCB corpora in 
words._ 


Coipus 

train 

valid 

test 

PTB 

930k 

74k 

82k 

LTC 

153M 

8.9M 

8.9M 


10 9 bytes of enwiki-20060303-pages-articles.xml. 
We split it into three parts: training (153M), val¬ 
idation (8.9M) and testing (8.9M) sets. We limit 
the vocabulary size to 80k for LTCB and replace 
all out-of-vocabulary words by a <UNK> token. 
Details of the two datasets can be found in Table 
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Figure 3: Illustration of a 2nd-order FOFE based 
FNN-LM. 


When feeding the FOFE codes to FNN as 
shown in Figure [TJ we can compute the activation 
signals (assume / is the activation function) in the 
first hidden layer for all histories in S as follows: 

H = /((MV)UW+b) = /(M(VU)W + b) 

where U denotes the word embedding matrix that 
projects the word indices onto a continuous low¬ 
dimensional continuous space. As above, VU 
can be done efficiently by looking up the embed¬ 
ding matrix. Therefore, for the computational ef¬ 
ficiency purpose, we may apply FOFE to the word 
embedding vectors instead of the original high¬ 
dimensional one-hot vectors. In the backward 
pass, we can calculate the gradients with the stan¬ 
dard back-propagation (BP) algorithm rather than 
BPTT. As a result, FOFE based FNN-FMs are the 
same as the standard FNN-FMs in terms of com¬ 
putational complexity in training, which is much 
more efficient than RNN-FMs. 


3 Experiments 


We have evaluated the FOFE method for NNFMs 
on two benchmark tasks: i) the Penn Treebank 
(PTB) corpus of about 1M words, following the 
same setup as [Mik olov, 201 1] . The vocabu¬ 
lary size is limited to 10k. The preprocess¬ 
ing method and the way to split data into train¬ 
ing/validation/test sets are the same as [Mikolov, 
2011| . ii) The Large Text Compression Bench¬ 
mark (LTCB) [ jMahoney, 201 If . In LTCB, we use 
the enwik9 dataset, which is composed of the first 


3.1 Experimental results on PTB 


We have first evaluated the performance of the 
traditional FNN-LMs, taking the previous several 
words as input, denoted as n-gram FNN-LMs here. 
We have trained neural networks with a linear pro¬ 
jection layer (of 200 hidden nodes) and two hid¬ 
den layers (of 400 nodes per layer). All hidden 
units in networks use the rectified linear activation 
function, i.e., f{x) = max(0, x). The nets are 
initialized based on the normalized initialization 
in | Glorot, 2010| , without using any pre-training. 
We use SGD with a mini-batch size of 200 and an 
initial learning rate of 0.4. The learning rate is kept 
fixed as long as the peiplexity on the validation set 
decreases by at least 1. After that, we continue six 
more epochs of training, where the learning rate is 
halved after each epoch. The performance (in per¬ 
plexity) of various n-gram FNN-LMs is shown in 
Table 0 


For the FOFE-FNNFMs, the net architecture 
and the parameter setting are the same as above. 
The mini-batch size is also 200 and each mini¬ 
batch is composed of several sentences up to 200 
words (the last sentence may be truncated). All 
sentences in the corpus are randomly shuffled at 
the beginning of each epoch. In this experiment, 
we first investigate how the forgetting factor a 
may affect the performance of FMs. We have 
trained two FOFE-FNNFMs: i) lst-order (using 
z t as input to FNN for each time t\ ii) 2nd-order 
(using both z t and z*_i as input for each time t, 
with a forgetting factor varying between [0.0,1.0]. 
Experimental results in Figure [4] have shown that 
a good choice of a lies between [0.5, 0.8]. Us- 


2 Matlab codes are available at https : / /wiki . eecs . 
yorku.ca/lab/MLL/projects:fofe:start for 

readers to reproduce all results reported in this paper. 
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Figure 4: Perplexities of FOFE FNNLMs as a 
function of the forgetting factor. 


Table 2: Perplexities on PTB for various LMs. 


Model 


FNNLM [Mikolov, 2012 1 


KN 5-gram [Mikolov, 20111 


RNNLM [ Mikolov, 201 11 
LSTM [Graves, 2013] 


bigram FNNLM 
trigram FNNLM 

4- gram FNNLM 

5- gram FNNLM 

6- gram FNNLM 


lst-order FOFE-FNNLM 
2nd-order FOFE-FNNLM 


Test PPL 
141 
140 
123 

117 
176 
131 

118 
114 
113 
116 
108 


ing a too large or too small forgetting factor will 
hurt the performance. A too small forgetting fac¬ 
tor may limit the memory of the encoding while a 
too large a may confuse LM with a far-away his¬ 
tory. In the following experiments, we set a = 0.7 
for the rest experiments in this paper. 

In Table [2j we have summarized the perplexi¬ 
ties on the PTB test set for various models. The 
proposed FOFE-FNNLMs can significantly out¬ 
perform the baseline FNN-LMs using the same 
architecture. For example, the peiplexity of the 
baseline bigram FNNLM is 176, while the FOFE- 
FNNLM can improve to 116. Moreover, the 
FOFE-FNNLMs can even overtake a well-trained 
RNNLM (400 hidden units) in [ [Mikolov, 20111 | 
and an LSTM in [Graves, 2013 j [ . It indicates 
FOFE-FNNLMs can effectively model the long¬ 
term dependency in language without using any 
recurrent feedback. At last, the 2nd-order FOFE- 
FNNLM can provide further improvement, yield¬ 
ing the peiplexity of 108 on PTB. It also outper¬ 
forms all higher-order FNN-LMs (4-gram, 5-gram 
and 6-gram), which are bigger in model size. To 
our knowledge, this is one of the best reported re¬ 
sults on PTB without model combination. 


3.2 Experimental results on LTCB 

We have further examined the FOFE based FNN- 
LMs on a much larger text coipus, i.e. LTCB, 
which contains articles from Wikipedia. We have 
trained several baseline systems: i) two n-gram 
LMs (3-gram and 5-gram) using the modified 
Kneser-Ney smoothing without count cutoffs; ii) 
several traditional FNN-LMs with different model 
sizes and input context windows (bigram, trigram. 


Table 3: Peiplexities on LTCB for various lan¬ 
guage models. [M*N] denotes the sizes of the in¬ 
rut context window and projection layer. 


Model 

Architecture 

Test PPL 

KN 3-gram 

- 

156 

KN 5-gram 

- 

132 


[1*200] -400-400-80k 

241 


[2*200]-400-400-80k 

155 

FNN-LM 

[2*200]-600-600-80k 

150 


[3*200] -400-400-80k 

131 


[4*200]-400-400-80k 

125 

RNN-LM 

[l*600]-600-80k 

112 


[1*200] -400-400-80k 

120 

FOFE 

[1 *200] -600-600-80k 

115 

FNN-LM 

[2*200]-400-400-80k 

112 


[2*200]-600-600-80k 

107 


4-gram and 5-gram ones); iii) an RNN-LM with 
one hidden layer of 600 nodes using the toolkit 
in [ [Mikolov, 2010[ , in which we have further used 
a spliced sentence bunch in | Chen et al. 2014} 
to speed up the training on GPUs. Moreover, we 
have examined four FOFE based FNN-LMs with 
various model sizes and input window sizes (two 
lst-order FOFE models and two 2nd-order ones). 
For all NNLMs, we have used an output layer of 
the full vocabulary (80k words). In these exper¬ 
iments, we have used an initial learning rate of 
0.01, and a bigger mini-batch of 500 for FNN- 
LMMs and of 256 sentences for the RNN and 
FOFE models. Experimental results in Table [3] 
have shown that the FOFE-based FNN-LMs can 
significantly outperform the baseline FNN-LMs 
(including some larger higher-order models) and 
also slightly overtake the popular RNN-based LM, 








































yielding the best result (perplexity of 107) on the 
test set. 


4 Conclusions 


In this paper, we propose the fixed-size ordinally- 
forgetting encoding (FOFE) method to almost 
uniquely encode any variable-length sequence into 
a fixed-size code. In this work, FOFE has been 
successfully applied to neural network language 
modeling. Next, FOFE may be combined with 
neural networks [Zhang and Jiang, 2015; , Zhang 
et. al., 2015] for other NLP tasks, such as sen¬ 
tence modeling/matching, paraphrase detection, 
machine translation, question and answer and etc. 


can have at most T real roots for a. Moreover, 
since = {0,1}, we can only have a finite set of 
equations in eq.Q. The total number is not more 
than 2 t . Therefore, in total, we can only have a 
finite number of a values that may satisfy at least 
one equation in eq.({2]), i.e., at most T ■ 2 1 possi¬ 
ble roots [ Oguz, 2015[ . Among them, only a frac¬ 
tion of these roots lies between (0.5,1.0). Except 
these countable choices of a values, eq.Q never 
holds for any other a values between (0.5,1.0). 
As a result, case (ii) never happens in decoding 
except some isolated points of a. This proves that 
the resultant FOFE code is almost unique between 
(0.5,1.0). ■ 
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Appendix A. The Proof of Theorem [2] 

Theorem 2: For 0.5 < a < 1, given any finite 
values of I\ and T, FOFE is almost unique every¬ 
where for a € (0.5,1.0), except only a finite set of 
countable choices of a. 

Proof: When we decode a given FOFE code of 
an unknown sequence S (assume the length of S 
is not more than T), for any single value 1.0 in 
the i-th position of the FOFE code, there are only 
two possible cases that may lead to ambiguity in 
decoding: (i) word uq appears in the current loca¬ 
tion of 5; or (ii) word Wi appears multiple times in 
the history of S and the total contribution of them 
happens to be 1.0. For case (ii) to happen, the for¬ 
getting factor a needs to satisfy at least one of the 
following polynomial equations: 

T 

= L0 ( 2 ) 

t =1 

where the above coefficients, £ t , are equal to ei¬ 
ther 1 or 0. If the word Wi appears in the t-th lo¬ 
cation ahead in the history, we have G = 1. Oth¬ 
erwise, = 0. We know, each equation in eq. © 
is a T -th (or lower) order polynomial equation. It 
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